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Abstract 

We suggest to employ techniques from Natural Lan- 
guage Processing (NLP) and Knowledge Representa- 
tion (KR) to transform existing documents into doc- 
uments amenable for the Semantic Web. Semantic 
Web documents have at least part of their seman- 
tics and pragmatics marked up explicitly in both a 
machine processable as well as human readable man- 
ner. XML and its related standards (XSLT, RDF, 
Topic Maps etc.) are the unifying platform for the 
tools and methodologies developed for different ap- 
plication scenarios. 



1 Motivation 

Imagine the following situation: As a consumer you 
are looking for information about a product. You 
may be interested in technical details, the price, de- 
livery conditions etc. For many products this type of 
information is available on web pages of companies. 
A situation that - despite the differences - is very 
much alike: As a scheduler of an automobile manu- 
facturer you are looking for subsidiary companies 
that are able to produce components or raw products 
for integration in a new production pipeline. Again 
you might profit from the information that is offered 
on web pages. An illustrative case are web pages of 
foundries. 

The problem nowadays still is: Although information 



searched for is available on WWW pages automated 
processes ('agents') will have problems in finding and 
extracting it. 

This motivates the vision of the 'Semantic Web' as 
expressed by Tim Berners-Lee: 'Semantic Web - a 
web of data that can be processed directly or indi- 
rectly by machines' 5 . 

The Semantic Web of the future will very likely be 
based on direct authoring pjj. That means future 
documents will contain metadata, semantic tagging 
will be employed to make intra-document relations 
explicit, topic maps and other technologies will be 
used to express semantic relations between docu- 
ments (inter-document relations). In other words: 
making semantics and pragmatics of documents ex- 
plicit via tagging will be an integral part of the doc- 
ument creation process. 

In the current situation we have a multitude of exist- 
ing web pages with valuable contents, far too many 
to be manually augmented and transformed into Se- 
mantic Web documents. We therefore suggest both 
automatic and semi-automatic augmentation of doc- 
uments. 

We are developing tools and methodologies based on 
NLP techniques, text technology and knowledge rep- 
resentation for the transformation of existing docu- 
ments into Semantic Web documents (see Fig. 
In the following we suggest to distinguish two types 
of documents. On the one hand there are enriched 
documents: These are documents that originate from 
the web or other sources. They undergo document 
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analysis and the results of the analysis are directly 
integrated into the document using XML markup. 
On the other hand there are transformed documents 
that are comprised from explicitly understood pieces 
of information extracted from other documents. 
On the surface enriched documents may have the 
same 'look and feel' as simple HTML pages, i.e., on 
a first glance the user may not recognize any differ- 
ence. The added value of the explicit enrichment with 
structural and semantic markup becomes apparent 
when automated processes (agents) are used. En- 
riched documents are suited for intelligent searches, 
querying, flexible recombination etc. 
The case of document transformation comes into play 
when pieces of information are reassembled and of- 
fered to the user in a uniform way as a document. 
As an example you can think of a rated overview 
page that is distilled from a collection of pages, e.g., 
a structured summary with the results of a compara- 
tive search on a collection of web pages from different 
manufacturers of a product. 

A general point to stress: In some sense the terms 
'semi-structured documents' or 'unstructured docu- 
ments' that are used sometimes are misleading from 
our point of view. Documents arc generally highly 
structured. The problem is not a lack of structure, 
the problem is that the structure is not made explicit 
in traditional documents. Human readers are in most 
cases able to easily uncover those implicit structures. 
Thus the major challenge is for automatic conversion 
of traditional (web) documents into Semantic Web 
documents: to uncover structure and contents and 
mark them up explicitly. This will then allow pro- 
cessing und interpretation of the documents by ma- 
chines. 

This paper is organized as follows: first we give a de- 
tailed description of our view of Semantic Web doc- 
uments. Then we describe encoding of information 
on web pages and mechanisms for transformation of 
implicit information into explicit data, followed by 
a realistic application scenario. Finally we discuss 
some of the open research issues. 



resources: 

- DTD 

-XML Schema 

- ontology 

- domain specific (e.g. taxonomy 
of industry) 

- language specific (e.g. lexicon, 
grammar, abbreviations) 

methods: 

- web page classification 

- (web page)-structure analysis 

- natural language processing 

- POS tagger 

- syntactic analysis 

- semantic tagger 

- case frame analyis 



web pages 



implicit information 
encoded as: 

- continuous text 

- structures (e.g. table, list) 

- images 



explicit markup: 

- logical structures 

- semantic classification 

- metadata 




Semantic 

Web 
documents 



Figure 1: Towards the Semantic Web. 

2 Towards Semantic Web Doc- 
uments 

The WWW is a fast growing source of heterogeneous 
information. For web document analysis in general, 
the analysis of text will have to be complemented 
by the analysis of other WWW media types: image 
analysis, video interpretation, voice processing etc. 
In addition, cross-media references and hypermedia 
structures need proper treatment. 
Natural language analysis of textual parts of web doc- 
uments is no different from 'normal' text analysis. 
For a given complex application in web document 
analysis we found it fruitful to classify its subtasks 
into the following three categories: 



subtasks that are primarily WWW specific, 
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• subtasks that are specific to the application, 

• subtasks that are relevant to all NLP ap- 
proaches. 

WWW specific subtasks can be classified as being 
part of the preprocessing stage. Preprocessing in this 
sense comprises all those operations that eventually 
result in a text document in the input format ex- 
pected by the linguistic tools. In other words, aspects 
of the source document that are irrelevant or distract- 
ing for linguistic processing will be abstracted away 
during preprocessing and the resulting document will 
be in a canonical format. 

If source documents already contain appropriate 
metadata some subtasks of preprocessing are reduced 
to looking up the values of metadata attributes. For 
now, preprocessing will in many cases include at- 
tempts at automatic language identification, domain 
classification or hyperlink tracing. 

2.1 The Power of Markup 

XML - and its precursor SGML - offer a formal- 
ism to annotate pieces of (natural language) text. To 
be more precise, if a piece of text is (as a simple 
first approximation) seen as a sequence of characters 
(alphabetic and whitespace characters) then XML 
allows to associate arbitrary markup with arbitrary 
subsequences of contiguous characters. Many linguis- 
tic units of interest are represented by strings of con- 
tiguous characters (e.g., words, phrases, clauses etc.). 
It is a straightforward idea to use XML to encode 
information about such a substring of a text inter- 
preted as a meaningful linguistic unit and to asso- 
ciate this information directly with the occurrence of 
the unit in the text. The basic idea of annotation is 
further supported by XML's wellformedness demand, 
i.e., XML elements have to be properly nested. This 
is fully concordant with standard linguistic practice: 
complex structures are made up from simpler struc- 
tures covering substrings of the full string in a nested 
way. The enrichment of documents is based on this 
ability to associate information directly with the re- 
spective span of text. 



3 Information Encoding on 
WWW Pages 

The starting point of our work are web pages as they 
are found now. In the following we report about 
results from corpus based case studies. 



3.1 Corpus Based Case Studies 

We employ a corpus based approach. Investigations 
do start with the collection of a corpus of represen- 
tative documents. 

Then a number of issues are systematically investi- 
gated: 

• What are the typical structure and contents of 
document instances in the corpus? 

• What information is most likely of interest for 
which type of users? 

• How can this information be located and ex- 
tracted? 

• What are characteristics of the source docu- 
ments that may make this task easier or more 
complicated? 

• What aspects can be generalized and abstracted 
from the specific case? 

3.2 Variations in Information Presen- 
tations 

When the focus is on contents, not on surface ap- 
pearance, then the notion of 'paraphrase' is no longer 
restricted to linguistic units. Our analyses of WWW 
pages revealed that there are many cases where the 
same information can more or less be conveyed both 
in a number of linguistic variations as well as in differ- 
ent non- linguistics formats ('extended paraphrases'). 
As an example we take the following excerpt from a 
web page of a garage manufacturer. 

Example 1 Excerpt from Web Page. 
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Wunschbox-Garagen sind als Typ S mit einer 
Breite von 2,68m, als Typ N (Breite 2,85m) 
und als Typ B (Breite 2,98m) lieferbar. Alle 
Garagen haben eine Hoehe von 2,46m. 

The phrasal pattern underlying the first sentence 

<product> is_available_as < enumeration of 
type info> 

is found in variations like the following: 

<product> (j,n sind als < enumerations 
lieferbar. 

<product> ( sg j ist als < enumeration> 
erhdltlich. 

<product> ( sg ) gibt es als < enumerations . ... 
Type info in turn is given according to patterns 
like: 

<enumeration of type info> == 

<type nri> with <feature> of <valuei>, 

<type nr2> with <feature> of <value2>, 

<type nri> with <feature> of < valued 

Note that the second sentence of example ^ needs 
contextual interpretation because its literal meaning 
in isolation would be the universally quantified asser- 
tion that 1 All garages have a height of ' 2,46m' and not 
the contextually restricted 1 All garages of types S, N 
and B of this manufacturer have a height of 2,46m\ 
Essentially the same information could as well be 
given in a variety of tabular formats. The third col- 
Table f : Tabular Information Presentation. 
Garage types: 



type 


width 


height 


S 


2,68 


2,46 


N 


2,85 


2,46 


B 


2,98 


2,46 



umn of tabled (height) could be omitted and replaced 
by the sentence: All (our) garages have a height of 
2,46m. 

In general, combinations of linguistic units and ta- 
bles are possible, e.g., like in All data of the following 
table refer to garage type N. 



3.3 Interaction Between Linguistic 
Structures and Source Document 
Markup 

There are interactions between list structures in 
HTML (and XHTML) and linguistic units that have 
to be accounted for in linguistic analysis of web pages. 
A simple case is the use of lists for enumerating con- 
cepts like in the following example. 

Example 2 Enumerating Concepts. 

<p>Die wichtigsten Branchen sind: <ul> 
<li>Formen- und Werkzeugbau< / li> 
<li>Eisenbahnwesen</li> </ul> 

Please note that even such simple structures do need 
proper treatment of coordination and truncation and 
that contextual interpretation is obligatory in or- 
der to correctly infer semantic relations between list 
heading and list items. 

Very often a partial sentence and an HTML list of 
sentential complements or other phrases interact like 
in the following example: 

Example 3 Interaction Between Sentence and List 
Structure. 

Wir produzieren maschinen- und 
handgeformten Grau- und Sphaeroguss 
<ul> 

<li>ca. 25.000 t Jahresproduktion 
<li>mit mehr als 4-000 lebenden Modellen 
<li>in mittleren und groesseren Serien 
<li>Handformguss bis 800 kg Stueckgewicht 
</ul> 

Here semantic interpretation unavoidably needs so- 
phisticated techniques for the interaction between the 
list items as phrasal type syntactic structures and the 
partial case frames created by the partial sentential 
structures as list headings. In some cases the list item 
is a full syntactic complement, in others the relation 
between item and heading is not structural but only 
on semantic grounds. 

Frequently, when processing realistic documents de- 
ficiencies of the source material have to be accounted 
for. An example: the analysis of the HTML sources 
of foundry web pages revealed that HTML was some- 
times misused in the sense that the intended layout 
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was not created by the appropriate tagging (which 
would allow to easily recover the intentions) but by 
misusing other tags to create a certain surface ap- 
pearance. Such deficient structures create a problem 
for automatic analysis. 
Examples of HTML misuse include: 

• creation of a frame-like layout using tables, 

• creation of a list-like layout with <P> and 
<BR> tags (see example 0J. 

Example 4 Misuse of Paragraphs. 

Folgende max. Abmessungen sind moeglich: 
<p> - bis zu 14-000 mm Laenge<br> - bis zu 
6.000 mm Durchmesser </p> 

3.4 Tools and Resources 

In spite of the specific aspects discussed above anal- 
ysis of textual (parts of) web pages has a lot in com- 
mon with document processing in general. We there- 
fore employ the XDOC 1 toolbox for this task and do 
combine it with web page specific modules. 

Methods 

For the analysis of information from web pages we 
need different tools and resources. The tools can be 
divided into: 

• Preprocessing tools like raw text extraction 
('HTML cleaner') and collector of all web pages 
from a company resp. link tracing tools. 

• Interpretation of HTML structures: what is the 
semantics behind HTML tags? 

• Linguistic tools for the semantic interpretation 
of linguistic structures, like sentences or phrases. 

The WWW page preprocessing tools are not directly 
relevant for the analysis of implicit information, these 
tools only collect and prepare the web pages. Rele- 
vant for the analysis of implicit information is the 
interpretation of the internal structure of web pages: 
Which pieces of information are embedded in which 

1 XDOC stands for XML based document processing. 



HTML structures? Are they relevant for the seman- 
tic interpretation of the contents inside the HTML 
structure? The main focus is on the analysis of lin- 
guistic structures, because inside HTML tags (e.g., 
tables or list structures etc.) we can find continuous 
text as whole sentences, or on a lower level phrases 
or simple lists of specific identifier, e.g., nouns or 
other tokens. For this task we use our document suite 
XDOC - a collection of linguistic tools (see @] for a 
detailed description of the functions inside XDOC). 
In all functions of XDOC the results of processing 
are encoded with XML tags delimiting the respective 
piece of text. The information conveyed by the tag 
name is enriched with XML attributes and their re- 
spective values. 

In the following subsections we give a short descrip- 
tion of separate functions for the analysis of web 
pages. Examples of the application of these functions 
are presented in section FOl 

• Part- of- Speech (POS) Tagger: The assignment 
of part-of-speech information to a token - POS 
tagging for short - is not only a preparatory step 
for parsing. The information gained about a doc- 
ument by POS tagging and evaluating its results 
is valuable in its own right. We use a morphol- 
ogy based approach to POS tagging (cf. 0] for 
details). 

• Syntactic Parsing: For syntactic parsing we ap- 
ply a chart parser based on context-free grammar 
rules augmented with feature structures. 

• Semantic Tagger: For semantic tagging we ap- 
ply a semantic lexicon. This lexicon contains the 
semantic interpretation of a token and a case 
frame combined with the syntactic valency re- 
quirements. Similar to POS tagging, the tokens 
are annotated with their meaning and a classifi- 
cation in semantic categories like, e.g., concepts 
and relations. 

• Case Frame Analysis: As a result of case frame 
analysis of a token we obtain details about the 
type of recognized concepts (resolving multiple 
interpretations) and possible relations to other 
concepts. 
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• Semantic Interpretation of Syntactic Structures 
(SISS): Another step to analyze the relations be- 
tween tokens can be the interpretation of the 
specific syntactic structure of a phrase or sen- 
tence. We exploit the syntactic structure of do- 
main specific sublanguages to uncover the se- 
mantic relations between related tokens. 

Resources 

The resources vary depending on the tools used (like 
grammars, abbreviation lexicon etc.). For the anal- 
ysis we need domain resources, e.g., specific tax- 
onomies of the domain, and document specific re- 
sources. The document specific resources describe 
the characteristics of the sublanguage inside the web 
pages (e.g., which technical terms are used, what are 
the syntactic types of phrases etc.). 

3.5 Web Page Classification 

For an efficient processing of web pages, which con- 
tain relevant information, a simple pre-classification 
of web pages was developed: 

• Information Pages: 

Information pages contain continuous text. The 
information on these pages may be structured in 
tables or be given as mixed information in the 
form of continuous text and tables or numera- 
tions. 

• Lead Pages: 

Lead pages contain links to other web pages 
about a single topic (or a single company). On 
these pages more links to internal web pages then 
to external web pages are found. 

• Overview Pages or Portals: 

Overview Pages contain both text and links to 
external web pages. Here links to providers with 
a similar product range or to related topics are 
found. 

This classification was chosen because of its appli- 
cability to other domains. Web pages of other in- 
dustrial sectors are organized in the same manner, 



for example in producing industry and online deal- 
ers, e.g., building industry (doors, windows, garages, 
real estate), insurances, automotive industry. The 
classification of web pages into information, lead and 
overview pages is sufficient for information extraction 
with regard to the creation of company profiles. 
Parameters for an automatic classification of web 
pages are the percentage of absolute text segments 
and hyperlinks of a web page, for example, the num- 
ber of internal links (e.g., one topic, company, or 
domain), external links (e.g., other topics, compa- 
nies or domains), tokens, and pictures. In evaluating 
links local directory structure is taken into account. 
Through specification of criteria (e.g., number of in- 
ternal links, the rate of internal and external links, 
the rate of tokens and pictures etc.) the user may af- 
fect the classification. Moreover web page classifica- 
tion is influenced by layout based structures (frames, 
scripts, controlling elements, like buttons). 

4 An Application Scenario 

This section is a case study about the creation of 
casting specific company profiles. There are about 
300 German foundries and they are present in the 
WWW with one or more web pages. A possible sce- 
nario is the following: a product cannot completely 
be produced in a company. In this case company 
profiles can be used for choosing a supplier company. 
Companies use the WWW as a kind of 'yellow pages' 
in order to get information about potential suppliers. 
A company profile includes a variety of information 
about a specific company. This information com- 
prises product related data (e.g., size and weight 
of the casting, material, moulding processes and 
quality assurance resp. certificates) and company 
related data (e.g., single-piece work, small, middle, 
and mass production). In addition, the location of 
the potential supplier is important in order to reduce 
transportation costs. 

Company profiles are applicable in two ways. First, 
existing company web pages can be enhanced by 
making implicitly available information explicit 
via semantic annotation. Second, the proposed 
semantic annotation of documents is usable in direct 
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authoring of future company web pages. 



4.1 Creation of Company Profiles 

For the user it is important to know which prod- 
ucts (e.g., boxes, engine blocks, cylinder heads, or 
axles) for which industrial sectors (e.g., motor indus- 
try, wind power industry, machine building industry 
etc.) are produced by the company in question. This 
information allows inferences with respect to quality 
requirements fulfilled by the company during previ- 
ous production processes. Company profiles also con- 
tain data like addresses and additional contact infor- 
mation. 

About 60 foundry specific web pages have been ana- 
lyzed with respect to data required for company pro- 
files. A first result is the following XML-DTD: 

Example 5 DTD for a Foundry Company Profile. 

< ! DDCTPYE profile [ 
< ! ELEMENT profile (foundry)+> 
<! ELEMENT foundry (name, specifics) > 
<! ELEMENT name Cf_name, contact, address) > 
<! ELEMENT contact (tel, fax*, email*, http)> 
< ! ELEMENT address (street, city, zip)> 
<!ELEMENT specifics (scope+, production, quality*)> 
<! ELEMENT scope (material, (weight*, dimension*) ) > 
<! ELEMENT dimension (#PCDATA)> 
< ! ATTLIST dimension for_what (mould I dim) "mould"> 
<! ELEMENT production (customer*, product*, i-sector*)> 
<! ELEMENT quality (#PCDATA)> ... ]> 

The DTD is based on web page analyses and gives 
all details of a company profile for foundries. Not all 
elements from the DTD are used by each company 
profile. This kind of DTD is used for presentation of 
contents. Parts of the DTD (e.g., address informa- 
tion, measures) can be used for analysis of web pages 
with our NLP tools. 

4.2 Instantiation of Elements 

In this section we present some examples for the 
recognition of information which is needed for the 
creation of a company profile. As an illustrative case 
we again take profiles of casting companies. 
In casting a lot of measurements, like mould, weight 
etc. are relevant. For the detection of measurement 
information we use the syntactic parser of XDOC. 
The grammar of the chart parser was extended with 
rules, which describe structures like: 2500 x 1^00 x 



600 or 1000 x 800 x 350 / 350 mm. A general pattern 
for this can be described by the following rules: 

MS-ENTRY: number measure (optional) 
3D-MS-ENTRY: MS-ENTRY x MS-ENTRY 
x MS-ENTRY 

and for foundry specific dimensions: 

3D-MS-ENTRY-C: MS-ENTRY x MS- 
ENTRY x MS-ENTRY / MS-ENTRY 

Measuring units (like mm, kg etc.) are handled in a 
part of the abbreviation lexicon. Example [5] shows 
the results of POS tagging of the phrase: ' Kastenfor- 
mat 500 x 600 x 150 / 150 mm'. 

Example 6 Results of POS Tagging. 

<N>Kastenformat</N> <NR>500</NR> <ABBR>x</ABBR> <NR>600</NR> 
<ABBR>x</ABBR> <NR>150</NR> <ABBR>/</ABBR> <NR>150</NR> 
<ABBR>mm</ABBR> 

Numbers are tagged with NR, nouns with N. The let- 
ter 'x' - used as a multiplication operator - is in a 
first step handled like an abbreviation (tag ABBR). In 
the future we may work with a separate category for 
these specific symbols. With the rules from above 
the syntactic parser interprets the results of the POS 
tagger as the following structure: 

Example 7 Results of Syntactic Parsing. 

OD-MS-ENTRY-C RULE="MEAS4"> 
OD-MS-ENTRY RULE="MEAS3"> 

<MS-ENTRY RULE="MEAS2"XNR>1000</NRX/MS-ENTRY> 
<ABBR>x</ABBR> 

<MS-ENTRY RULE="MEAS2"XNR>800</NRX/MS-ENTRY> 
<ABBR>x</ABBR> 

<MS-ENTRY RULE="MEAS2"XNR>350</NRX/MS-ENTRY> 
</3D-MS-ENTRY> 
<ABBR>/ </ABBR> 

<MS-ENTRY RULE="MEASl"XNR>350</NRXABBR>mm</ABBR> 
</MS-ENTRYX/3D-MS-ENTRY-C> 

Now with the SISS approach we are able to assign 
the different numbers in the phrase to height, length, 
width and the diameter dimensions of a cylindric 3D- 
object. The SISS approach splits whole syntactic 
structure into smaller structures and assigns a sense 
to these structures. The possible assignments are 
coded in a lexicon. For this example the assignments 
are shown in example [SJ 

Example 8 Excerpt from the Lexicon for SISS. 
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ASSIGNMENTS RULE="3D-MS-ENTRY-C"> 
<CDMPONENT NAME="3D-MS-ENTRY"> 

<EXPAND>3-dimension</EXPANDX/C0MPDNENT> 
COMPONENT NAME="ABBR">NIL</CDMPONENT> 

<CDMPONENT NAME=" MS-ENTRY" >dimension-diameter</COMPONENT> 
</ASSIGNMENTS> 

ASSIGNMENTS RULE="3D-MS-ENTRY"> 
<CDMPONENT NAME=" MS-ENTRY 11 > 

<EXPAND>dimens ion-he ight</EXPANDx/CDMPQNENT> 
<CDMPONENT NAME="ABBR">NIL</CDMPONENT> 
COMPONENT NAME=" MS-ENTRY " > 

<EXPAND>dimension-length</EXPANDx/CDMPQNENT> 
<CDMPONENT NAME="ABBR">NIL</CDMPONENT> 
<CDMPONENT NAME= " MS-ENTRY" > 

<EXPAND>dimension-width</EXPANDx/CQMPONENT> 
</ASSIGNMENTS> 

For each child in a structure the assignments define 
an interpretation; if the child structure is also a struc- 
ture, which is separately described in the lexicon, it 
will be annotated with the tag EXPAND. The seman- 
tic sense is described through the element of the tag 
EXPAND, e.g., dimension-length. A COMPONENT with 
the element NIL means that this child will not be in- 
terpreted. 

Another method for the detection of semantic rela- 
tions between linguistic structures is XDOC's case 
frame analysis. A case frame describes relations be- 
tween various syntactic structures, like a token (noun 
or verb) and its linguistic complements in the form 
of noun phrases or prepositional phrases. An exam- 
ple: The german phrase ' Formanlagen fuer Grau- 
gwss' can be semantically interpreted through the 
analysis of the case frame of the token ' Formanlagen 
(see example EJ- The case frame contains a relation 
named TECHNIQUE, the filler of this relation must be 
of the semantic category process (described by the 
tag ASSIGN-TO) and a syntactic structure of a prepo- 
sitional phrase with an accusative noun phrase and 
the preposition 'fuer' (tag FORM). In our example the 
preposition phrase 'fuer Grauguss' is recognized as a 
filler for the relation TECHNIQUE, because the token 
'Grauguss' is from the semantic category PROCESS. 

Example 9 Results of Case Frame Analysis. 

<CONCEPTS> <C0NCEPT TYPE="tool"> 
<WDRD>Formanlagen</WORD> 
<DESC>Maschine</DESC> 
<SLOTS> 

<RELATI0N TYPE= " TECHN I QUE " > 

<ASSIGN-TO>PROCESS</ASSIGN_TO> 
<FORM>P(akk, fak, fuer)</F0RM> 
<CONTENT>fuer Grauguss</CONTENT> 
</RELATI0N> 



</SLOTS> 
</CDNCEPT> 

<CONCEPT TYPE="process"> 

<WORD>Grauguss</WORD> 

<DESC>Fertigungsprozess</DESC> 
</C0NCEPT> </C0NCEPTS> 

5 Research Topics 

We report about work in progress. In the following 
we will sketch some of the open questions to be in- 
vestigated in the near future. 

Treatment of Coordination To be applicable on 
a realistic scale, a toolbox for document processing 
needs generic solutions on all levels of the linguistic 
system (lexical, syntactic, semantic, discourse). 
It is well known that there are always interactions 
and interdependencies between different levels. An 
example: In our current work we investigate possi- 
ble solutions for the issues of coordinated structures. 
These structures need proper treatment on the lexical 
level (e.g., treatment of prefix and suffix truncation in 
POS tagging), syntactic level (e.g., grammar rules for 
adjective groups, noun groups and mixtures of both) 
and on the semantic level (e.g., rules to decide about 
a disjunctive vs. a conjunctive reading). 
Complex coordinated structures are relevant in virtu- 
ally all our technical and medical applications. Thus 
our solutions should clearly be generic and indepen- 
dent of the domain, but domain knowledge as a re- 
source may be needed for semantic interpretation. 
Coordinated structures are relevant for querying doc- 
uments as well. As a point in case see phrases like 
Klein- und Mittelserien (in english: small batch and 
middle production). If you search for Kleinserien 
with the help of a conventional search engine, you 
probably will not find web pages with the phrase 
Klein- und Mittelserien. 

Towards Generic Solutions NLP has both an 
analytic and a generative 'procedural' perspective in 
the sense that grammars are not only used to merely 
'describe' linguistic structures but to actually ana- 
lyze or generate them. The analogy for documents 
would be to not only use descriptions (DTDs, XML 
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Schemata) to validate already marked up instances 
but to employ descriptions both for analysis of yet 
unmarked documents as well as for the generation of 
documents that obey the rules of a schema. 
The move from DTDs to XML Schemata is a big 
step forward and allows to better model document 
contents and structure. From the perspective of 
not only describing but automatically analyzing doc- 
uments further improvements should be envisaged. 
Will it be possible to integrate information about 
the automatic detection of document elements into 
an extended document description framework? Such 
a possibility would allow to declaratively configure 
schemata for documents that could then be exploited 
for the processing of raw documents that implicitly 
follow the schema but do not have explicit markup 
yet. 

Limits of Markup Where are the limits for en- 
riching documents with XML markup? The basic 
idea of delimiting spans of text in a document within 
an opening and a closing tag is simple and convincing. 
But: will simple well formed markup really suffice? 
When will we unavoidably have to work with concur- 
rent markup [2]? When will we have to give up the 
adjacency requirement for text spans and will have 
to deal with discontinuous structures as well? 

6 Conclusion 

The vision of the Semantic Web is a stimulat- 
ing driving force for research in text technology, 
Computational Linguistics (CL) and knowledge 
representation. The basic issues are actually not 
new, but they now receive a much broader attention 
than before. These different approaches are comple- 
mentary, combining them will result in synergetic 
effects. 

The focus of the SGML/XML approaches to doc- 
uments has been on providing means to describe 
document structures (i.e., DTDs, schemata) and on 
tools to validate already marked up document in- 
stances with respect to an abstract description (i.e., 
parsers, validators). From an NLP perspective the 
weakest point in SGML/XML is that the contents of 



terminal elements (i.e., elements with text only) is 
simply uninterpreted PCDATA, i.e., strings without 
their internal structure made explicit. Although in 
principle markup can go down to the granularity of 
single characters there is a 'traditional' bias in the 
markup community towards macro level structuring 
with paragraphs as the typical grain size. 
On the other hand, many - but definitely not all 
- techniques and tools from CL and NLP have 
their focus on the word, phrase or sentence level. 
Why not combine macro level structuring with 
lexical, syntactic and semantic analysis of otherwise 
uninterpreted PCDATA? 

The third field - knowledge representation - is 
indispensable because semantics needs grounding 
in formal KR structures. Ontologies (i.e., basic 
vocabularies for expressing meaningful relations) 
and worked out a taxonomy of interrelated domain 
concepts are the backbone of any semantic account 
of document contents. 
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